WB BERA Analysis

Yue Han,04/01/2021

This notebook performs Exploratory Data Analysis on BERA's backfilled data narrowed to WB's own brands. It analyzes and creates visualizations of the four KPIs, five Marketing Levers, and Imagery Brands Personality for each of our brands. It also includes my findings and provides some busienss insights and recommendations. Furthue analysis can be conducted after we joining with the Neustar and MA data, as well as when we can define our goal and have some business quesitons to answer.

What is BERA

BERA is a brand management company that provide brand tracking in real time. It collected brands data by distributing surveys with 1M participates each year with 4000 brands across 200 sectors in 87 markets globally with census data matched. The survey was conducted by invitation for participants to complete via mobile device ot desktop browser with cash incentives. The survey scape is population balanced and market-defined with careful structured volumn and frequency system.
The survey syndicated metrics and measures includes 4 KPIs, 5P marketing levers, and 5 imagery categories with 42 adjectives.

The 4 KPIs includes Familarity, Meaningfulness, regard and uniqueness, which is very straighforward to understand that how farmiliar you are with and how highly you regard this brand, as well as how meaningful/relevant and how competitively unique the brand is. The 5P marketing levers includes Product - how much it provides a product or service that meets my needs, Price - your willingness to pay a premium for, Promotion - weather it has advertising communications that are meaningful to me, Place - avaliablity when and where I want it (disitribution), and People - weather it has great people (service, ambassadors, users). Both of the 4KPIs and 5Ps are scaled between 1-7. For the imagery catogories, participates was asked to selected the adjectives that they think related to the brands. Selected would be maked as 1, otherwise 0.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import seaborn as sns
from plotly.subplots import make_subplots
from wordcloud import WordCloud, ImageColorGenerator
from IPython.display import Image
from PIL import Image
from sklearn.cluster import KMeans
from sklearn import preprocessing
import squarify 
plt.style.use('ggplot')
sns.set_theme(style="ticks", color_codes=True)
In [2]:
bera = pd.read_csv("bera_backfilled.csv")
In [3]:
bera.head()
Out[3]:
RECORD_DATE AK_USER_ID AUDIENCE BRAND_ID BRAND_LABEL CATEGORY_ID CATEGORY_NAME CHARMING CHEERFUL CONFIDENT ... TOUGH TRENDY UNIQUE_ UNIQUENESS UPPERCLASS UPTODATE USECON WHOLESOME YOUNG METADATA_FILENAME
0 2020-07-13 -7536869195463930174 1 3212 Batman (character) 240 Entertainment Brands 0 0 0 ... 0 0 0 4 0 0 3 0 0 neustar/dsdk/events/2021/03/25/v2/site_activit...
1 2020-07-13 -7851963756735996159 1 3943 The Flash (character) 240 Entertainment Brands 1 1 1 ... 0 1 0 7 1 0 1 0 1 neustar/dsdk/events/2021/03/25/v2/site_activit...
2 2020-07-13 -3955948066284559267 1 3940 Superman (character) 240 Entertainment Brands 0 0 0 ... 0 0 0 3 1 0 6 0 1 neustar/dsdk/events/2021/03/25/v2/site_activit...
3 2020-07-13 8598148965403265438 1 3940 Superman (character) 240 Entertainment Brands 0 0 0 ... 0 0 0 1 0 0 6 0 0 neustar/dsdk/events/2021/03/25/v2/site_activit...
4 2020-07-13 3780639877935854143 1 6666 The Conjuring Series 240 Entertainment Brands 0 0 0 ... 0 0 0 4 0 0 2 0 0 neustar/dsdk/events/2021/03/25/v2/site_activit...

5 rows × 79 columns

In [4]:
bera.shape
Out[4]:
(25574, 79)
In [5]:
#check features and NA
bera.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25574 entries, 0 to 25573
Data columns (total 79 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   RECORD_DATE        25574 non-null  object 
 1   AK_USER_ID         25574 non-null  int64  
 2   AUDIENCE           25574 non-null  int64  
 3   BRAND_ID           25574 non-null  int64  
 4   BRAND_LABEL        25574 non-null  object 
 5   CATEGORY_ID        25574 non-null  int64  
 6   CATEGORY_NAME      25574 non-null  object 
 7   CHARMING           25574 non-null  int64  
 8   CHEERFUL           25574 non-null  int64  
 9   CONFIDENT          25574 non-null  int64  
 10  CONTEMPORARY       25574 non-null  int64  
 11  COOL               25574 non-null  int64  
 12  CORPORATE          25574 non-null  int64  
 13  CVS                24517 non-null  float64
 14  DARING             25574 non-null  int64  
 15  DATE_FILTER        25574 non-null  object 
 16  DEMO_AGE           25574 non-null  int64  
 17  DMA                25555 non-null  float64
 18  DOWNTOEARTH        25574 non-null  int64  
 19  EDUCATION          25574 non-null  int64  
 20  ETHNICITY          25574 non-null  int64  
 21  EXCITING           25574 non-null  int64  
 22  FAMILIARITY        25574 non-null  int64  
 23  FAMILYORIENTED     25574 non-null  int64  
 24  FEMININE           25574 non-null  int64  
 25  FILTER             25574 non-null  object 
 26  FRIENDLY           25574 non-null  int64  
 27  GENDER             25574 non-null  int64  
 28  GLAMOROUS          25574 non-null  int64  
 29  GOODLOOKING        25574 non-null  int64  
 30  HARDWORKING        25574 non-null  int64  
 31  HH_ASSETS          25574 non-null  int64  
 32  HH_CHILDREN        25574 non-null  int64  
 33  HH_LANGUAGE        25574 non-null  int64  
 34  HISPANIC           25574 non-null  int64  
 35  HONEST             25574 non-null  int64  
 36  ID                 25574 non-null  int64  
 37  IMAGINATIVE        25574 non-null  int64  
 38  INCOME             25574 non-null  int64  
 39  INDEPENDENT        25574 non-null  int64  
 40  INTELLIGENT        25574 non-null  int64  
 41  LEADER             25574 non-null  int64  
 42  LGBT               12904 non-null  float64
 43  MASCULINE          25574 non-null  int64  
 44  MEANINGFULNESS     25574 non-null  int64  
 45  MONTH              25574 non-null  object 
 46  NONEOFTHESE        25574 non-null  int64  
 47  NPS                12474 non-null  float64
 48  ORIGINAL           25574 non-null  int64  
 49  OUTDOORSY          25574 non-null  int64  
 50  PEOPLE             23876 non-null  float64
 51  PLACEMENT          24066 non-null  float64
 52  PRICE              24517 non-null  float64
 53  PRODUCT            23612 non-null  float64
 54  PROMOTION          24198 non-null  float64
 55  QUARTER            25574 non-null  object 
 56  REAL               25574 non-null  int64  
 57  REGARD             25574 non-null  int64  
 58  REGION_ROLLUP      25574 non-null  int64  
 59  RELIABLE           25574 non-null  int64  
 60  RUGGED             25574 non-null  int64  
 61  SECURE             25574 non-null  int64  
 62  SENTIMENTAL        25574 non-null  int64  
 63  SINCERE            25574 non-null  int64  
 64  SMALLTOWN          25574 non-null  int64  
 65  SMOOTH             25574 non-null  int64  
 66  SPIRITED           25574 non-null  int64  
 67  SUCCESSFUL         25574 non-null  int64  
 68  TECHNICAL          25574 non-null  int64  
 69  TOUGH              25574 non-null  int64  
 70  TRENDY             25574 non-null  int64  
 71  UNIQUE_            25574 non-null  int64  
 72  UNIQUENESS         25574 non-null  int64  
 73  UPPERCLASS         25574 non-null  int64  
 74  UPTODATE           25574 non-null  int64  
 75  USECON             25574 non-null  int64  
 76  WHOLESOME          25574 non-null  int64  
 77  YOUNG              25574 non-null  int64  
 78  METADATA_FILENAME  25574 non-null  object 
dtypes: float64(9), int64(62), object(8)
memory usage: 15.4+ MB

We noticed missing value occured on CVS, LGBT, NPS, PEOPLE, PLACEMENT, PRICE, PRODUCT, and PROMOTION, since those attributes are not so important for now, we just leave them there at this moment.

In [6]:
bera.AUDIENCE.value_counts()
Out[6]:
1    25574
Name: AUDIENCE, dtype: int64
In [7]:
bera.CATEGORY_NAME.value_counts()
Out[7]:
Entertainment Brands    25574
Name: CATEGORY_NAME, dtype: int64
In [8]:
bera.DATE_FILTER.value_counts()
Out[8]:
2020 week xxxxx    21251
2021 week xxxxx     4323
Name: DATE_FILTER, dtype: int64
In [9]:
bera.FILTER.value_counts()
Out[9]:
Total Population    25574
Name: FILTER, dtype: int64
In [10]:
bera.MONTH.value_counts()
Out[10]:
2020-09    4704
2020-08    3496
2020-11    3494
2020-12    3468
Jan-21     3460
2020-10    3455
2020-07    2634
Feb-21      863
Name: MONTH, dtype: int64

Remove column AUDIENCE, CATEGORY_NAME, CATEGORY_ID as it has identical value.

In [11]:
bera.drop(['AUDIENCE','CATEGORY_ID','CATEGORY_NAME', "METADATA_FILENAME"], 
          inplace=True, axis=1, errors='ignore')

What date range does the data span?

In [12]:
print(bera.RECORD_DATE.min())
print(bera.RECORD_DATE.max())
2020-07-13
2021-02-01
In [13]:
bera.RECORD_DATE.value_counts()
Out[13]:
2020-09-01    1206
2020-07-20     894
2020-09-07     886
2020-09-14     885
2020-12-07     881
2020-09-28     880
2020-08-24     880
2020-11-23     879
2020-08-17     878
2020-10-12     878
2020-12-21     877
2020-11-16     874
2020-11-02     873
2021-01-18     872
2020-07-27     872
2020-08-10     871
2020-07-13     868
2020-11-09     868
2020-08-03     867
2021-01-25     866
2021-01-11     866
2020-10-26     863
2021-02-01     863
2020-12-28     860
2020-10-19     857
2020-10-05     857
2021-01-04     856
2020-12-01     850
2020-09-21     847
Name: RECORD_DATE, dtype: int64

BERA_March data records data from 2020-07-13 to 2021-02-01.

How many distinct brands of WB?

In [14]:
bera.BRAND_LABEL.nunique()
Out[14]:
11

The Flintstones (characters) and Flintstones are the same, so let's replace Flintstones with Flintstones (characters)

In [15]:
bera['BRAND_LABEL'] = bera['BRAND_LABEL'].replace(['Flintstones'], 'The Flintstones (characters)')
In [16]:
#check brands distribution
brands = bera.BRAND_LABEL.value_counts().to_frame().reset_index()
brands.columns = ['brand', 'frequency']
brands
Out[16]:
brand frequency
0 Batman (character) 2769
1 Looney Tunes (characters) 2767
2 Superman (character) 2765
3 Scooby Doo (character) 2765
4 The Flintstones (characters) 2749
5 Harry Potter (character) 2741
6 Wonder Woman (character) 2718
7 The Flash (character) 2445
8 Justice League (characters) 2308
9 The Conjuring Series 1547
In [17]:
brands["studio"] = "WB"
In [18]:
brands["category"] = ['Superhero', 'Animation', 'Superhero', 'Animation', 'Other Franchise', 
                    'Superhero', 'Superhero', 'Superhero', 'Animation', 'Other Franchise']
In [19]:
fig = px.treemap(brands,
                 path=['studio', 'category', 'brand'],
                 values='frequency')

fig.update_layout(title="WB Brands Composition",
                  width=1000, height=700,)

fig.show()
In [20]:
#plot brans distribution bar chart
fig = px.bar(brands, x="brand", y="frequency", title="Brands Distribution")
fig.update_layout(width=800, height=500)
fig.show()

From the distribution, we noticed that the frequency of the top 7 are approximately evenly distributed, range from 2718 to 2769, while with only 1547 records of the last one The Conjuring Series.

Four KPIs

Determine Overall Brand Performance & Momentum

Familarity - how familar you are with ...
Regard - how highly you regard ...
Meaningfullness - how meaningful/relevant ... is
Unqueness - how competitively unique ... is

In [21]:
plt.style.use('ggplot')
plt.figure(figsize=(15, 6))
plt.hist([bera.FAMILIARITY,bera.REGARD, bera.MEANINGFULNESS, bera.UNIQUENESS],
        label=['FAMILIARITY', 'REGARD', 'MEANINGFULNESS', 'UNIQUENESS'], alpha = 0.8)
plt.legend(loc='upper left')
plt.suptitle("KPI Distribution", size=20)
plt.show()

We notice a significant high frequency of score 7 of familarity, which can indicate our WB brands overall are very wellknown. So I believe it is the biggetst strength of our WB brands.

In [22]:
fig, axs = plt.subplots(2, 2, figsize=(12,8))
axs[0,0].hist(bera.FAMILIARITY)
axs[0,0].set_title("Familarity Distribution")
axs[0,1].hist(bera.REGARD)
axs[0,1].set_title("Regard Distribution")
axs[1,0].hist(bera.MEANINGFULNESS)
axs[1,0].set_title("Meaningfulness Distribution")
axs[1,1].hist(bera.UNIQUENESS)
axs[1,1].set_title("Uniqueness Distribution")
Out[22]:
Text(0.5, 1.0, 'Uniqueness Distribution')

Each of the KPI distribution plots are approximately left skewed, so the median value is higher than the mean value.
It can didicate our brands are loved my most people. However, one point worth for attention here is that in regarding of the meaningfulness, unlike three other KPIs who has the most frequency on score 7, we have the most amount of frequency of score 4 and 5. Because uniqueness and meaningfulness drive the success of our business, so I believe this can be an area for improvment of our brands.

In [23]:
kpis = bera[["FAMILIARITY", "REGARD", "MEANINGFULNESS", "UNIQUENESS"]]
kpis
Out[23]:
FAMILIARITY REGARD MEANINGFULNESS UNIQUENESS
0 6 5 4 4
1 7 7 7 7
2 7 1 2 3
3 2 1 1 1
4 5 4 4 4
... ... ... ... ...
25569 3 3 3 5
25570 6 7 6 1
25571 7 4 6 7
25572 7 4 2 4
25573 2 1 1 1

25574 rows × 4 columns

In [24]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
mask = np.triu(np.ones_like(kpis.corr(), dtype=np.bool))
heatmap = sns.heatmap(kpis.corr(),vmin=-1, vmax=1, annot = True, cmap='BrBG')
heatmap.set_title('4KPIs Correlation Heatmap', fontdict={'fontsize':18}, pad=12);

Those 4 KPIs are all strongly positively correlated with each other. The Regard and Meaningfullness have the strongest positive correlation with each other. We can interpreted it as that if people think a brand is meaningful to them, then they are more like to have a high regard to this brand as well. However, familarity are least correlated with the other three KPIs. That is if people is farmilar with a brand, although they may more likely to think the brand is unique, meaningful and regard higher, but not so strong as other KPIs.

Five Marketing Levers

Assess Awareness and performance of 5Ps

Product - provides a product or service that meets my needs
Price - willing to pay a premium for
Promotion - has advertising communications that are meaningful to me
Place - avaliable whe and where I want it (disitribution)
People - has great people (service, ambassadors, users)

In [25]:
fig, axs = plt.subplots(2, 3, figsize=(12,8))
axs[0,0].hist(bera.PRODUCT, color='steelblue')
axs[0,0].set_title("Product Distribution")
axs[0,1].hist(bera.PLACEMENT, color='steelblue')
axs[0,1].set_title("Placement Distribution")
axs[0,2].hist(bera.PEOPLE, color='steelblue')
axs[0,2].set_title("People Distribution")

axs[1,0].hist(bera.PRICE, color='brown')
axs[1,0].set_title("Price Distribution")

axs[1,1].hist(bera.PROMOTION, color='brown')
axs[1,1].set_title("Promotion Distribution")

fig.delaxes(axs[1][2])

The price represents people's willingness to pay a premium for our brands, however, we have the most responses of score 1 regrading that. People's willingness to pay can depend on various reasons, including The state of the economy, How trendy/in-season a product is, Consumer’s personal price points, Circumstantial needs in different consumers, The rareness of a product, The quality of a product, etc. We should dig furthur on that after joining with the household information.
In terms of promoriton, we have the most counts on the two extrem sides, and it can indicate our adviertising communication has meaningness for some people, while just not interest to the other.

In [26]:
p5s = bera[["PRICE", "PROMOTION", "PRODUCT", "PLACEMENT", "PEOPLE"]]
p5s
Out[26]:
PRICE PROMOTION PRODUCT PLACEMENT PEOPLE
0 2.0 2.0 5.0 4.0 5.0
1 7.0 7.0 7.0 7.0 7.0
2 7.0 6.0 6.0 7.0 7.0
3 1.0 1.0 1.0 1.0 1.0
4 NaN NaN NaN 7.0 NaN
... ... ... ... ... ...
25569 1.0 1.0 NaN 1.0 NaN
25570 7.0 6.0 7.0 7.0 7.0
25571 5.0 7.0 7.0 7.0 7.0
25572 1.0 5.0 7.0 4.0 7.0
25573 1.0 1.0 1.0 1.0 1.0

25574 rows × 5 columns

In [27]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
mask = np.triu(np.ones_like(p5s.corr(), dtype=np.bool))
heatmap = sns.heatmap(p5s.corr(),vmin=-1, vmax=1, annot = True, cmap='BrBG')
heatmap.set_title('5Ps Correlation Heatmap', fontdict={'fontsize':18}, pad=12);

All 5Ps are highly correlated with each other, especially b/t price and promotion. So we can conclude that people's willingness to pay highly depends on weather they see the brand's advertising communications were meaningful to them. So let's dig deeper to find out what group of people would see our brands advertising communications were meaningful to them, so we can focus on those people for marketing campaign in the future.

In [28]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
mask = np.triu(np.ones_like(kpis.corr(), dtype=np.bool))
heatmap = sns.heatmap(bera[['PROMOTION','INCOME', 'DEMO_AGE', 'EDUCATION', 'HH_CHILDREN', 'GENDER','REGION_ROLLUP']].corr(),vmin=-1, vmax=1, annot = True, cmap='BrBG')
heatmap.set_title('Promotion & Demographic Correlation Heatmap', fontdict={'fontsize':18}, pad=12);

From the correlation, we find out that Income and number of children have a moderate positive correlation with Promotion, Age and gender have a moderate negative correlation with Promotion, while education and regions keep about neutral. So Let's take a look at how does people's Income, age, gender and number of children ditributed for both highest and lowest promotion.

In [29]:
low_promo = bera.loc[bera['PROMOTION'] == 1]
high_promo = bera.loc[bera['PROMOTION'] == 7]
In [30]:
#get the percentage of each income level for both low and high price
low_perc = low_promo.INCOME.value_counts()/low_promo.INCOME.count()
low_perc = low_perc.to_frame().reset_index()
low_perc.columns = ['income', 'no_meaningful_promotion']

high_perc = high_promo.INCOME.value_counts()/high_promo.INCOME.count()
high_perc = high_perc.to_frame().reset_index()
high_perc.columns = ['income', 'has_meaningful_promotion']

#join low and high price together
promo_perc = pd.merge(left = low_perc, right = high_perc, left_on = 'income', right_on = 'income')
promo_perc = promo_perc.sort_values("income")

promo_perc.plot(x="income", y=["no_meaningful_promotion", "has_meaningful_promotion"], kind = "bar", alpha=0.8)
plt.title("High Promotion vs. Low Promotion Income Distribution")
positions = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
labels = ["Less than $10,000", "$10,000 to $14,999", "$15,000 to $24,999", "$25,000 to $34,999", 
          "$35,000 to $49,999", "$50,000 to $74,999", "$75,000 to $99,999", "$100,000 to $149,999", 
          "$150,000 to $199,999", "$200,000 to $499,999", "$500,000 to $1 million", "$1 million or more"]
plt.xticks(positions, labels, rotation = 30)
plt.show()
In [31]:
#get the percentage of each age level for both low and high price
low_perc = low_promo.DEMO_AGE.value_counts()/low_promo.DEMO_AGE.count()
low_perc = low_perc.to_frame().reset_index()
low_perc.columns = ['age', 'no_meaningful_promotion']

high_perc = high_promo.DEMO_AGE.value_counts()/high_promo.DEMO_AGE.count()
high_perc = high_perc.to_frame().reset_index()
high_perc.columns = ['age', 'has_meaningful_promotion']

#join low and high price together
price_perc = pd.merge(left = low_perc, right = high_perc, left_on = 'age', right_on = 'age')
price_perc = price_perc.sort_values("age")

price_perc.plot(x="age", y=["no_meaningful_promotion", "has_meaningful_promotion"], kind = "bar", alpha=0.8)
plt.title("High Promotion vs. Low Promotion Age Distribution")
positions = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
labels = ["18-21", "22-24", "25-29", "30-34", "35-39", "40-44",
          "45-49", "50-54", "55-59", "60-64", "65 or Older"]
plt.xticks(positions, labels, rotation = 30)
plt.show()
In [32]:
#get the percentage for each number of childrens for both low and high price
low_perc = low_promo.HH_CHILDREN.value_counts()/low_promo.HH_CHILDREN.count()
low_perc = low_perc.to_frame().reset_index()
low_perc.columns = ['child_no', 'no_meaningful_promotion']

high_perc = high_promo.HH_CHILDREN.value_counts()/high_promo.HH_CHILDREN.count()
high_perc = high_perc.to_frame().reset_index()
high_perc.columns = ['child_no', 'has_meaningful_promotion']

#join low and high price together
price_perc = pd.merge(left = low_perc, right = high_perc, left_on = 'child_no', right_on = 'child_no')
price_perc = price_perc.sort_values("child_no")

price_perc.plot(x="child_no", y=["no_meaningful_promotion", "has_meaningful_promotion"], kind = "bar", alpha=0.8)
plt.title("High Promotion vs. Low Promotion Number of Children Distribution")
positions = (0, 1, 2, 3, 4)
labels = ["None", "1", "2", "3", "4 or more"]
plt.xticks(positions, labels, rotation = 360)
plt.show()
In [33]:
#get the percentage for each number of childrens for both low and high price
low_perc = low_promo.GENDER.value_counts()/low_promo.GENDER.count()
low_perc = low_perc.to_frame().reset_index()
low_perc.columns = ['gender', 'no_meaningful_promotion']

high_perc = high_promo.GENDER.value_counts()/high_promo.GENDER.count()
high_perc = high_perc.to_frame().reset_index()
high_perc.columns = ['gender', 'has_meaningful_promotion']

#join low and high price together
price_perc = pd.merge(left = low_perc, right = high_perc, left_on = 'gender', right_on = 'gender')
price_perc = price_perc.sort_values("gender")

price_perc.plot(x="gender", y=["no_meaningful_promotion", "has_meaningful_promotion"], kind = "bar", alpha=0.8)
plt.title("High Promotion vs. Low Promotion Gender Distribution")
positions = (0, 1)
labels = ["Male", "Female"]
plt.xticks(positions, labels, rotation = 360)
plt.show()

Cluster by Income & promotion

In [34]:
price_income = bera[['PRICE','INCOME']].dropna()
In [35]:
x1 = price_income.values
x1.shape
Out[35]:
(24517, 2)
In [36]:
wcss_1 = []
for i in range(1, 11):
    km = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    km.fit(x1)
    wcss_1.append(km.inertia_)
In [37]:
plt.plot(np.arange(1 , 11) , wcss_1 , 'o')
plt.plot(np.arange(1 , 11) , wcss_1 , '-' , alpha = 0.5)
plt.title('The Elbow Method', fontsize = 20)
plt.xlabel('No. of Clusters')
plt.ylabel('wcss_1')
plt.show()
In [38]:
kmeans_1=KMeans(n_clusters=4,init='k-means++',max_iter=300,n_init=10,random_state=0)
y_kmeans_1=kmeans_1.fit_predict(x1)
In [39]:
labels_1 = kmeans_1.labels_
centroids_1 = kmeans_1.cluster_centers_
In [40]:
labels_1
Out[40]:
array([2, 3, 1, ..., 3, 2, 2], dtype=int32)
In [41]:
centroids_1
Out[41]:
array([[   1.95552885, 1002.55378606],
       [   5.7007034 , 1007.38323564],
       [   1.67838235, 1006.42176471],
       [   5.72776924, 1003.51885987]])
In [42]:
plt.scatter(x1[y_kmeans_1 == 0, 0], x1[y_kmeans_1 == 0, 1], s = 100, c = 'pink', label = 'target')
plt.scatter(x1[y_kmeans_1 == 1, 0], x1[y_kmeans_1 == 1, 1], s = 100, c = 'yellow', label = 'general')
plt.scatter(x1[y_kmeans_1 == 2, 0], x1[y_kmeans_1 == 2, 1], s = 100, c = 'cyan', label = 'miser')
plt.scatter(x1[y_kmeans_1 == 3, 0], x1[y_kmeans_1 == 3, 1], s = 100, c = 'magenta', label = 'spendthrift')
plt.scatter(x1[y_kmeans_1 == 4, 0], x1[y_kmeans_1 == 4, 1], s = 100, c = 'orange', label = 'careful')
plt.scatter(centroids_1[:,0], centroids_1[:, 1], s = 50, c = 'blue' , label = 'centeroid')

plt.style.use('fivethirtyeight')
plt.title('Clusters: Annual Income vs Spending Score', fontsize = 20)
plt.xlabel('Price')
plt.ylabel('Income')
plt.legend()
plt.grid()
plt.show()

Now let's study the 4KPIs and 5Ps for each WB brands.
See below for the mean of each KPIs for each brand.

In [ ]:
 
In [43]:
brand_kpi_mean = bera.groupby('BRAND_LABEL')['FAMILIARITY', 'REGARD', 'UNIQUENESS', 'MEANINGFULNESS',
                                            'PRODUCT', 'PRICE', 'PROMOTION', 'PLACEMENT', 'PEOPLE'].mean()
brand_kpi_mean
<ipython-input-43-99c1f60e4136>:1: FutureWarning:

Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

Out[43]:
FAMILIARITY REGARD UNIQUENESS MEANINGFULNESS PRODUCT PRICE PROMOTION PLACEMENT PEOPLE
BRAND_LABEL
Batman (character) 5.804261 4.795955 4.830264 4.425063 4.735820 3.956101 4.297624 4.754260 4.888805
Harry Potter (character) 5.533382 4.651222 4.918278 4.326158 4.584421 3.939681 4.135218 4.717751 4.717434
Justice League (characters) 5.038562 4.479636 4.466638 4.181976 4.594684 4.020796 4.240536 4.715877 4.733647
Looney Tunes (characters) 5.872425 5.366823 5.300687 5.032526 5.271434 4.609474 4.796674 5.037023 5.115414
Scooby Doo (character) 5.940687 5.170344 5.197107 4.783002 5.058256 4.262110 4.552461 4.986191 5.140398
Superman (character) 5.831103 4.913562 4.802170 4.525497 4.959985 4.052017 4.586571 4.848566 5.233246
The Conjuring Series 4.915966 4.398836 4.391080 4.104719 4.274539 3.957103 3.956676 4.428876 3.974702
The Flash (character) 5.051943 4.267485 4.339468 3.899796 4.354943 3.633420 3.885765 4.300133 4.557071
The Flintstones (characters) 5.912696 5.142961 5.190251 4.761368 5.020368 4.321077 4.601380 4.813945 4.968677
Wonder Woman (character) 5.349522 4.639441 4.650110 4.305740 4.804907 3.947469 4.356011 4.554033 5.062280

The mean of each score dosen't differ that much, ranging from 4-5.

Let's visilize the KPIs for each brands with radar plots.

In [44]:
brand_lable = sorted(bera.BRAND_LABEL.unique())
In [45]:
fig = go.Figure()
for i in [3,4,8]:
    fig.add_trace(go.Scatterpolar(r=brand_kpi_mean.iloc[i].values,
                                theta=brand_kpi_mean.columns,
                                fill='toself',
                                name=brand_lable[i],
                                showlegend=True,)
                 )
fig.update_layout(polar=dict(radialaxis=dict(
                        visible=True,
                        range=[3.5, 6])),
                  title="Average KPIs & 5Ps of Animation Brands"
)

fig.show()

Among those three animations, although Scooby Doo has the highest farmilarity, people has the least willingness to pay for it.

Looney Tunes overall has the best well rounded score,with an especially high in PRODUCT. This can indicates Looney Tuens related products can be very popular and greatly meet consumers' needs.

In [58]:
fig = go.Figure()
for i in [0,5,9,7,2]:
    fig.add_trace(go.Scatterpolar(r=brand_kpi_mean.iloc[i].values,
                                theta=brand_kpi_mean.columns,
                                fill='toself',
                                name= brand_lable[i],
                                showlegend=True,)
                 )
fig.update_layout(polar=dict(radialaxis=dict(
                             visible=True,
                             range=[3.5, 6])
                            ),
                  title="Average KPIs & 5Ps of Super Heroes"
                 )
fig.show()

Among those Justice League Brands, people's response shows that they are much more familiar with the individual heroes rather than the Justice League over all. I think that's because

As expected, Batman and Superman owns the highest familarity. However, people has low willingness to pay for them, especially Superman.

Wonder Women stands out on all 5 marketing levers, which can indicate we have done a greate on the marketing and promotion of Wonder Women. The data are recorded this Jan, when the WW84 just released, so this result makes sense.

For the Flash, it especially lacks on the PRICE, however, it did a good job on its PRODUCT. I think the Tshirt and many other Merchandise with the its flash logo are very popular. I remember Shlton has one flash T-shirt he loves a lot.

In [47]:
fig = go.Figure()
for i in [1,6]:
    fig.add_trace(go.Scatterpolar(r=brand_kpi_mean.iloc[i].values,
                                theta=brand_kpi_mean.columns,
                                fill='toself',
                                name="KPI-mean-%s"%brand_lable[i],
                                showlegend=True,)
                )
fig.update_layout(polar=dict(radialaxis=dict(
                        visible=True,
                        range=[3.5, 6])
                        ),
                  title="Average KPIs & 5Ps of Other Franchise Brands"
)
fig.show()

For the Franchies Series, Harry Potter defintely performs best, but not that good as I expected when compred with those previous brands. People are farmiliar with it, but has very low willingness to pay for it. It also lacks on the promotion, my guess is that maybe we haven't conduct the campign of Happy Potter for a while because the whole film series has already finished many years ago, and people's attention has been distracted.

Comparatively, The Conjuring Series is not that successful, it especially lacks on PROMOTION and MEANINFULNESS. So I think we can do a better job on promoteing this brand, as well as improving on the meaningfulness of the movies.

Brands Imagery

Characterize Type and Intensity of Associations

Brand imagery can be divided into 5 categories, including sincerity, excitement, competence, sophistication and ruggedness.
42 category-agnostic attributes asker model as binary 'associated'/'not' or 'none'.

In [48]:
sincerity = bera[["BRAND_LABEL", "DOWNTOEARTH", "FAMILYORIENTED", "SMALLTOWN", "HONEST", "SENTIMENTAL",
                 "REAL", "WHOLESOME", "ORIGINAL", "CHEERFUL", "SINCERE", "FRIENDLY"]]
excitement = bera[["BRAND_LABEL", "DARING", "TRENDY", "EXCITING", "SPIRITED", "INDEPENDENT", "YOUNG",
                  "IMAGINATIVE", "UNIQUE_", "UPTODATE", "COOL", "CONTEMPORARY"]]
competence = bera[["BRAND_LABEL", "RELIABLE", "HARDWORKING", "SECURE", "INTELLIGENT", "TECHNICAL",
                  "CORPORATE", "SUCCESSFUL", "LEADER", "CONFIDENT"]]
sophistication = bera[["BRAND_LABEL", "UPPERCLASS", "GLAMOROUS", "GOODLOOKING", "CHARMING", "FEMININE", "SMOOTH"]]
ruggedness = bera[["BRAND_LABEL", "OUTDOORSY", "MASCULINE", "TOUGH", "RUGGED"]]
In [49]:
sincerity_sum = sincerity.drop('BRAND_LABEL', axis=1).sum().to_frame().reset_index()
sincerity_sum.columns = ['attribute', 'frequency']
sincerity_sum = sincerity_sum.sort_values('frequency', ascending=False)
#sincerity_sum

excitement_sum = excitement.drop('BRAND_LABEL', axis=1).sum().to_frame().reset_index()
excitement_sum.columns = ['attribute', 'frequency']
excitement_sum = excitement_sum.sort_values('frequency', ascending=False)
#excitement_sum

competence_sum = competence.drop('BRAND_LABEL', axis=1).sum().to_frame().reset_index()
competence_sum.columns = ['attribute', 'frequency']
competence_sum = competence_sum.sort_values('frequency', ascending=False)
#competence_sum

sophistication_sum = sophistication.drop('BRAND_LABEL', axis=1).sum().to_frame().reset_index()
sophistication_sum.columns = ['attribute', 'frequency']
sophistication_sum = sophistication_sum.sort_values('frequency', ascending=False)
#sophistication_sum

ruggedness_sum = ruggedness.drop('BRAND_LABEL', axis=1).sum().to_frame().reset_index()
ruggedness_sum.columns = ['attribute', 'frequency']
ruggedness_sum = ruggedness_sum.sort_values('frequency', ascending=False)
#ruggedness_sum
In [50]:
imagery = [sincerity_sum, excitement_sum, competence_sum, sophistication_sum, ruggedness_sum]
imagery_df = pd.concat(imagery)

Let's visilize the overall image of WB's brands with wordcloud.

In [51]:
d = {}
for a, x in imagery_df.values:
    d[a] = x
    
#wordcloud = WordCloud(background_color="white", mask = mask)
wordcloud = WordCloud(background_color="white")
wordcloud.generate_from_frequencies(frequencies=d)
mask = np.array(Image.open("WB.png"))
image_colors = ImageColorGenerator(mask)
plt.figure(figsize=[10,10])
plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear")
#plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")

plt.show()

Our brands stands out most with the image of Family-oriented. On audience targeting, this may inform us to segment the audience with families.
So now let's study the image for each of our brands.

In [52]:
imagery_all = [sincerity, excitement, competence, sophistication, ruggedness]
imagery_all_df = pd.concat(imagery_all)
imagery_all_sum = imagery_all_df.groupby('BRAND_LABEL').sum()
imagery_all_sum.drop(["OUTDOORSY", "MASCULINE", "TOUGH", "RUGGED"], 
                     inplace=True, axis=1, errors='ignore')
imagery_all_sum
Out[52]:
DOWNTOEARTH FAMILYORIENTED SMALLTOWN HONEST SENTIMENTAL REAL WHOLESOME ORIGINAL CHEERFUL SINCERE ... CORPORATE SUCCESSFUL LEADER CONFIDENT UPPERCLASS GLAMOROUS GOODLOOKING CHARMING FEMININE SMOOTH
BRAND_LABEL
Batman (character) 459.0 760.0 250.0 683.0 484.0 426.0 471.0 962.0 374.0 494.0 ... 401.0 1040.0 932.0 952.0 446.0 401.0 794.0 580.0 188.0 447.0
Harry Potter (character) 480.0 1119.0 312.0 580.0 528.0 403.0 628.0 968.0 561.0 546.0 ... 323.0 1127.0 701.0 683.0 305.0 351.0 566.0 719.0 217.0 342.0
Justice League (characters) 366.0 806.0 219.0 450.0 349.0 336.0 476.0 637.0 422.0 358.0 ... 378.0 758.0 550.0 657.0 251.0 398.0 572.0 406.0 263.0 317.0
Looney Tunes (characters) 662.0 1428.0 377.0 569.0 851.0 507.0 899.0 1147.0 1123.0 516.0 ... 434.0 1024.0 552.0 544.0 237.0 310.0 471.0 712.0 228.0 378.0
Scooby Doo (character) 846.0 1524.0 424.0 724.0 809.0 467.0 997.0 1082.0 1173.0 703.0 ... 353.0 915.0 481.0 594.0 258.0 337.0 619.0 949.0 234.0 422.0
Superman (character) 630.0 1030.0 543.0 895.0 562.0 393.0 837.0 937.0 563.0 730.0 ... 347.0 968.0 908.0 995.0 328.0 419.0 1100.0 842.0 194.0 466.0
The Conjuring Series 197.0 207.0 221.0 207.0 209.0 271.0 197.0 435.0 187.0 198.0 ... 216.0 398.0 232.0 279.0 183.0 172.0 208.0 209.0 180.0 212.0
The Flash (character) 431.0 735.0 281.0 574.0 356.0 357.0 488.0 689.0 480.0 444.0 ... 295.0 733.0 605.0 722.0 260.0 338.0 656.0 575.0 171.0 365.0
The Flintstones (characters) 785.0 1476.0 539.0 619.0 857.0 460.0 926.0 1106.0 1053.0 579.0 ... 290.0 860.0 391.0 454.0 221.0 270.0 419.0 776.0 271.0 307.0
Wonder Woman (character) 530.0 858.0 279.0 719.0 470.0 446.0 632.0 807.0 574.0 628.0 ... 327.0 975.0 836.0 1034.0 342.0 953.0 1160.0 749.0 1366.0 397.0

10 rows × 37 columns

Let's visilize the image of each brand by plotting its top 10 attributes composition.

In [53]:
for i in range(10):
    value = pd.DataFrame(imagery_all_sum.iloc[i].values)
    col = pd.DataFrame(imagery_all_sum.columns)
    df = pd.concat([col, value], axis=1)
    df.columns = ['attr', 'frequency']
    top_10 = df.sort_values(by='frequency', ascending=False).head(10)

    plt.figure(figsize=(8, 5))
    squarify.plot(sizes=top_10.frequency, label=top_10.attr, alpha=0.6)
    plt.title("Imagery of %s"%brand_lable[i], size = 15)
    plt.show()

The top 10 attributes of each brand makes sense and correspondes to their features.